2024年5月26日
By: Chase

use babashka to download landing page

Introduce

Recently, I got a junior job on Upwork and the client wants me to copy some html templates, make some changes, such as replace imgs, replace text, and then submit html + css + img files, no scripts.

I manually copied and pasted a few and thought

I'm a fucking software engineer, at least an engineer, not a textile factory worker, writing a script to do this is the right way.

I am very familiar with JavaScript. If I use NodeJS with Cheerio and htmlparser2, it will be easier.

But I saw a job that wanted to use Babashka for API queries, I know it a bit from BABASHKA BABOOKA.

So I decided to develop the script with my poor Clojure knowledge.

Goals

  1. Create basic files, like:
./demo
├── css
│   ├── style.css
│   └──...other necessary need donwload
├── imgs
│   ├── img1.png
│   ├── img2.png
│   └── ...imgs
├── index.html
└── original.html
  1. Download the html to original.html with no edition, just for checking.
  2. Download the html to index.html, go to the next steps
  3. Remove all the script element
  4. Remove all the link element(opt)
  5. Move all the styles content into a new made file style.css
  6. Download all the imgs to imgs file, and change its src to relative path like ./imgs/img1.png

Prepation

  1. Enviroment for running babashark.
  2. Choose the dependencies I might need, here's the codes:
  (require
   '[babashka.pods :as pods]
   '[babashka.http-client :as http]
   '[babashka.fs :as fs]
   '[clojure.string :as str])
  
  (pods/load-pod 'retrogradeorbit/bootleg "0.1.9")
  
  (require '[pod.retrogradeorbit.bootleg.utils :as bootleg]
           '[pod.retrogradeorbit.hickory.select :as s])

Dependencies

babashka.http-client, babashka.fs, clojure.string are easy to be understood, you can guess its function just by name.

pods is for importing extra dependencies avaliable in babashka.

I need bootleg to transform the html string to hickory, it's a html parser. And select supply the select function to select the element I want, or just recur the hickory content.

Development

Let's finish the goals one by one.

Create default files

(defn- create-default-files [dir-name]
  (let [css-dir             (str dir-name "/css")
        imgs-dir            (str dir-name "/imgs")
        basic-css-file      (str dir-name "/css/style.css")
        original-html-file  (str dir-name "/original.html")
        submit-html-file    (str dir-name "/index.html")]
    (when (fs/exists? dir-name)
      (fs/delete-tree dir-name))
    (fs/create-dir dir-name)
    (fs/create-dir css-dir)
    (fs/create-dir imgs-dir)
    (fs/create-file basic-css-file)
    (fs/create-file original-html-file)
    (fs/create-file submit-html-file)))

Download html

(defn- fetch-url [url]
  (let [response (http/get url)]
    (:body response)))

Download and write

Test this in your repl or comment:

(def testname "test")
(create-default-files testname)
(fs/write-lines 
  (str testname "/original.html")
  [(fetch-url "https://digitalwebrocket.com/rocketpack/")])

You have downloaded the html content to test/original.html now.

Handle the html

  1. change the html to a hickory map like: 图-0

  2. Recur to do the different handle in goals:

  (defn recur-hick [data]
    (if (map? data)
      (let [content (:content data)]
        (-> data
            (assoc
              data
              :content
              ;;handle the content here
              (mapv recur-hick content))))
    (do
      (prn data)
      data)))
  1. Write handle function:
  (defn- download-img
  "down load img to imgs-dir"
  [url imgs-dir] 
  (let [img-name (last (str/split url #"/"))
        img-path (str imgs-dir "/" img-name)]
    (prn "downloading..." url)
    (try
      (fs/write-bytes
       img-path
       (-> url
           (http/get {:as :bytes})
           :body))
      (catch Exception e (prn "error when download-img"))
    )))
 
   (defn handle-html-element [data]
    (let [{:keys [tag attrs type content]} data]
      (case tag
        :img (do (download-img (get attrs :src) "test/imgs"))
        :style (do
                ;; write the content to style.css
                (fs/write-lines "test/css/style.css" content {:append true})
                (assoc data :content []))
        :script (assoc data :content [])
        ;; add the link of the css file
        :head   (update data :content
                  #(conj % {:type :element
                            :tag :link
                            :attrs {:rel "stylesheet"
                                    :href "./css/style.css"}}))
        data)))
  1. Add a main function recive the script params
(defn main []
  (let [[url & args]        *command-line-args*
        ;; recive 'dir-name' from the script or the 'last word' in the url
        dir-name            (or (first args) 
                                (-> url
                                    (str/split #"/")
                                    last))]))
(main)

you could run the script like this now:

bb ./download_html.clj "https://digitalwebrocket.com/rocketpack/"

=> url = https://digitalwebrocket.com/rocketpack/, dir-name = rocketpack

Final Codes

(ns download-html
  (:require
   [babashka.pods :as pods]
   [babashka.http-client :as http]
   [babashka.fs :as fs]
   [clojure.string :as str]))

(pods/load-pod 'retrogradeorbit/bootleg "0.1.9")

(require '[pod.retrogradeorbit.bootleg.utils :as bootleg]
         '[pod.retrogradeorbit.hickory.select :as s])

(defn- fetch-url [url]
  "http request to get the html content"
  (let [response (http/get url)]
    (:body response)))

(defn- download-img
  "down load img to imgs-dir"
  [url imgs-dir]
  (let [img-name (last (str/split url #"/"))
        img-path (str imgs-dir "/" img-name)]
    (prn "downloading..." url "to" img-path)
    (try
      (fs/write-bytes
       img-path
       (-> url
           (http/get {:as :bytes})
           :body))
      (catch Exception e (prn "error when download-img")))))

(defn- fix-url
  "if the url begin with '//' add 'https:'"
  [url]
  (if (str/starts-with? url "//")
    (str "https:" url)
    url))

(defn- handle-html-element [data imgs-dir css-file-path]
  "handle the img/style/script/head element"
  (let [{:keys [tag attrs type content]} data]
    (case tag
      :img (let [url      (:src attrs)
                 new-attr (-> data
                              (get :attrs)
                              (dissoc :srcset)
                              (assoc :src
                                     (str "./imgs/"
                                          (last (str/split url #"/")))))]
             ;; some path begin with // with no 'https'
             (download-img (fix-url url) imgs-dir)

             ;; remove srcset and change src to a replated path 
             (assoc data :attrs new-attr))
      :style (do
               (when (vector? content)
                 (fs/write-lines css-file-path content {:append true}))
               (assoc data :content []))
      :script (-> data
                  (assoc :content [])
                  (assoc :attrs nil))
      :head (update data :content
                    #(conj % {:type :element
                              :tag :link
                              :attrs {:rel "stylesheet"
                                      :href "./css/style.css"}}))
      data)))

(defn- recur-hick [html-data imgs-dir css-file-path]
  "recur the hickory data"
  (if (map? html-data)
    (let [handled-data (handle-html-element html-data imgs-dir css-file-path)]
      (-> handled-data
          (assoc
           :content
           (mapv #(recur-hick % imgs-dir css-file-path)
                 (:content handled-data)))))
    (do
      #_(prn "xxx" html-data)
      html-data)))


(defn main []
  "the script main function"
  (let [[url & args]        *command-line-args*
        dir-name            (or (first args)
                                (-> url
                                    (str/split #"/")
                                    last))
        css-dir             (str dir-name "/css")
        imgs-dir            (str dir-name "/imgs")
        basic-css-file      (str dir-name "/css/style.css")
        original-html-file  (str dir-name "/original.html")
        submit-html-file    (str dir-name "/index.html")
        html                (-> url
                                fetch-url
                                str/trim)]
    (when (fs/exists? dir-name)
      (fs/delete-tree dir-name))
    (fs/create-dir dir-name)
    (fs/create-dir css-dir)
    (fs/create-dir imgs-dir)
    (fs/create-file basic-css-file)
    (fs/create-file original-html-file)
    (fs/create-file submit-html-file)

    (fs/write-lines original-html-file [html])

    (fs/write-lines submit-html-file
                    [(-> html
                         (bootleg/convert-to :hickory)
                         (recur-hick imgs-dir basic-css-file)
                         (bootleg/convert-to :html))])))

(main)

图-0

Improvement Todos

  1. Remove all the links
  2. Download all the css files and handle the situation href="//"
  3. Download all the js files and handle the situation href="//"
  4. Download all the font sources
Tags: Bootleg Clojure Babashka